What is Craigslist?
“Craigslist” is a renowned name in every household. It is an American-based advertising platform for sections like employment, housing, for sale, items wanted, services, community service, gigs, resumes, and discussion forums. In 1996, it changed to a web-based service and added more classified categories. In 2000, it began to spread to additional American and Canadian cities, and it presently includes 70 nations.
We will be focusing on the vehicle-selling segment of Craigslist. Being one of the top-rated, convenient, and user-friendly platforms where one can buy and sell their vehicles, it can help the sellers to gain a reasonable price. It consists of a huge repository with relevant data and particulars of all used vehicles at a go in one location.
Vehicles are preferable in the USA over public transport as the use of vehicles is comparatively feasible and practical. We have chosen Craigslist as our topic since it is the top reputed platform for buying and selling vehicles.
The main objective of Craigslist as a project is to analyze the data and find insights to help craigslist find answers to the below-mentioned questions:
Which states have the highest number of vehicle listings?
What brands have the most expensive and least expensive vehicles among the listed vehicles??
What is the percentage of the listed vehicle as per the condition?
Which are the most important variables in price prediction?
Predict the best price to buy a used vehicle?
How the pricing problem will be solved?
We will be using supervised learning and dividing the data into training and testing sets. Based on training data, we will be predicting the “price” of every vehicle in the testing data set.
To predict our dependent(target) variable “price”, we will be using independent variables like manufacture year, model of car, car’s condition, craigslist region, etc. It will consist of all the relevant information to help craigslist buy the cars at a reasonable yet profitable.
The data is in CSV file format and its size is 1.45 GB. It has a total 426880 number of observations and 26 Variables. Please find below the column names and descriptions:
| Variables | Defination |
|---|---|
| id | Unique id for every listed car |
| url | listing url |
| region | craigslist region |
| region_url | URL for the region |
| year | manufacture year |
| manufacturer | manufacturer of vehicle |
| model | model of vehicle |
| condition | condition of vehicle |
| cylinders | number of cylinders |
| fuel | fuel type of vehicle |
| odometer | miles traveled by vehicle |
| title_status | status of vehicle |
| transmission | transmission of vehicle |
| VIN | vehicle identification number |
| drive | type of drive |
| size | size of vehicle |
| type | generic type of vehicle |
| paint_color | color of vehicle |
| image_url | image URL |
| description | listed description of vehicle |
| county | useless column left in by mistake |
| state | state of listing |
| lat | latitude of listing |
| long | longitude of listing |
| posting_date | posting date of car |
| price | price of car |
library(plotly)
library(dplyr)
library(tidyverse)
library(olsrr)
library(caTools)
library(xgboost)
Craigslist_data = read.csv("Vehicles.csv")
Craigslist_data = Craigslist_data[1:100000,]
nrow(Craigslist_data)
## [1] 100000
ncol(Craigslist_data)
## [1] 26
summary(Craigslist_data)
## id url region region_url
## Min. :7.209e+09 Length:100000 Length:100000 Length:100000
## 1st Qu.:7.309e+09 Class :character Class :character Class :character
## Median :7.314e+09 Mode :character Mode :character Mode :character
## Mean :7.312e+09
## 3rd Qu.:7.316e+09
## Max. :7.317e+09
##
## price year manufacturer model
## Min. :0.000e+00 Min. :1900 Length:100000 Length:100000
## 1st Qu.:5.999e+03 1st Qu.:2008 Class :character Class :character
## Median :1.400e+04 Median :2013 Mode :character Mode :character
## Mean :1.111e+05 Mean :2011
## 3rd Qu.:2.599e+04 3rd Qu.:2017
## Max. :3.025e+09 Max. :2022
## NA's :428
## condition cylinders fuel odometer
## Length:100000 Length:100000 Length:100000 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 36657
## Mode :character Mode :character Mode :character Median : 82543
## Mean : 96450
## 3rd Qu.: 130000
## Max. :10000000
## NA's :1521
## title_status transmission VIN drive
## Length:100000 Length:100000 Length:100000 Length:100000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## size type paint_color image_url
## Length:100000 Length:100000 Length:100000 Length:100000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## description county state lat
## Length:100000 Mode:logical Length:100000 Min. :-84.12
## Class :character NA's:100000 Class :character 1st Qu.: 33.78
## Mode :character Mode :character Median : 36.40
## Mean : 36.79
## 3rd Qu.: 38.89
## Max. : 82.39
## NA's :539
## long posting_date
## Min. :-158.60 Length:100000
## 1st Qu.:-120.70 Class :character
## Median :-116.36 Mode :character
## Mean :-107.62
## 3rd Qu.: -88.27
## Max. : 173.89
## NA's :539
head(Craigslist_data,5)
## id
## 1 7222695916
## 2 7218891961
## 3 7221797935
## 4 7222270760
## 5 7210384030
## url
## 1 https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html
## 2 https://fayar.craigslist.org/ctd/d/bentonville-2017-hyundai-elantra-se/7218891961.html
## 3 https://keys.craigslist.org/cto/d/summerland-key-2005-excursion/7221797935.html
## 4 https://worcester.craigslist.org/cto/d/west-brookfield-2002-honda-odyssey-ex/7222270760.html
## 5 https://greensboro.craigslist.org/cto/d/trinity-1965-chevrolet-truck/7210384030.html
## region region_url price year
## 1 prescott https://prescott.craigslist.org 6000 NA
## 2 fayetteville https://fayar.craigslist.org 11900 NA
## 3 florida keys https://keys.craigslist.org 21000 NA
## 4 worcester / central MA https://worcester.craigslist.org 1500 NA
## 5 greensboro https://greensboro.craigslist.org 4900 NA
## manufacturer model condition cylinders fuel odometer title_status
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## transmission VIN drive size type paint_color image_url description county
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## state lat long posting_date
## 1 az NA NA
## 2 ar NA NA
## 3 fl NA NA
## 4 ma NA NA
## 5 nc NA NA
colnames(Craigslist_data)
## [1] "id" "url" "region" "region_url" "price"
## [6] "year" "manufacturer" "model" "condition" "cylinders"
## [11] "fuel" "odometer" "title_status" "transmission" "VIN"
## [16] "drive" "size" "type" "paint_color" "image_url"
## [21] "description" "county" "state" "lat" "long"
## [26] "posting_date"
str(Craigslist_data)
## 'data.frame': 100000 obs. of 26 variables:
## $ id : num 7.22e+09 7.22e+09 7.22e+09 7.22e+09 7.21e+09 ...
## $ url : chr "https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html" "https://fayar.craigslist.org/ctd/d/bentonville-2017-hyundai-elantra-se/7218891961.html" "https://keys.craigslist.org/cto/d/summerland-key-2005-excursion/7221797935.html" "https://worcester.craigslist.org/cto/d/west-brookfield-2002-honda-odyssey-ex/7222270760.html" ...
## $ region : chr "prescott" "fayetteville" "florida keys" "worcester / central MA" ...
## $ region_url : chr "https://prescott.craigslist.org" "https://fayar.craigslist.org" "https://keys.craigslist.org" "https://worcester.craigslist.org" ...
## $ price : num 6000 11900 21000 1500 4900 ...
## $ year : int NA NA NA NA NA NA NA NA NA NA ...
## $ manufacturer: chr "" "" "" "" ...
## $ model : chr "" "" "" "" ...
## $ condition : chr "" "" "" "" ...
## $ cylinders : chr "" "" "" "" ...
## $ fuel : chr "" "" "" "" ...
## $ odometer : int NA NA NA NA NA NA NA NA NA NA ...
## $ title_status: chr "" "" "" "" ...
## $ transmission: chr "" "" "" "" ...
## $ VIN : chr "" "" "" "" ...
## $ drive : chr "" "" "" "" ...
## $ size : chr "" "" "" "" ...
## $ type : chr "" "" "" "" ...
## $ paint_color : chr "" "" "" "" ...
## $ image_url : chr "" "" "" "" ...
## $ description : chr "" "" "" "" ...
## $ county : logi NA NA NA NA NA NA ...
## $ state : chr "az" "ar" "fl" "ma" ...
## $ lat : num NA NA NA NA NA NA NA NA NA NA ...
## $ long : num NA NA NA NA NA NA NA NA NA NA ...
## $ posting_date: chr "" "" "" "" ...
Craigslist_data = subset(Craigslist_data,select = -c(1,2,4,15,20,21,26))
colnames(Craigslist_data)
## [1] "region" "price" "year" "manufacturer" "model"
## [6] "condition" "cylinders" "fuel" "odometer" "title_status"
## [11] "transmission" "drive" "size" "type" "paint_color"
## [16] "county" "state" "lat" "long"
id, url,
region_url, VIN, image_url,
description, posting_date.colSums(is.na(Craigslist_data))
## region price year manufacturer model condition
## 0 0 428 4420 1284 38289
## cylinders fuel odometer title_status transmission drive
## 40159 603 1521 2279 514 30474
## size type paint_color county state lat
## 72154 19611 29776 100000 0 539
## long
## 539
colMeans(is.na(Craigslist_data))
## region price year manufacturer model condition
## 0.00000 0.00000 0.00428 0.04420 0.01284 0.38289
## cylinders fuel odometer title_status transmission drive
## 0.40159 0.00603 0.01521 0.02279 0.00514 0.30474
## size type paint_color county state lat
## 0.72154 0.19611 0.29776 1.00000 0.00000 0.00539
## long
## 0.00539
County has 100% NA values. Removing the column.Craigslist_data = subset(Craigslist_data,select = -c(16))
colnames(Craigslist_data)
## [1] "region" "price" "year" "manufacturer" "model"
## [6] "condition" "cylinders" "fuel" "odometer" "title_status"
## [11] "transmission" "drive" "size" "type" "paint_color"
## [16] "state" "lat" "long"
Craigslist_data = na.omit(Craigslist_data)
colSums(is.na(Craigslist_data))
## region price year manufacturer model condition
## 0 0 0 0 0 0
## cylinders fuel odometer title_status transmission drive
## 0 0 0 0 0 0
## size type paint_color state lat long
## 0 0 0 0 0 0
nrow(Craigslist_data)
## [1] 17571
boxplot(Craigslist_data$price)
range(Craigslist_data$price)
## [1] 0 1111111111
Q1 <- quantile(Craigslist_data$price, .25)
Q3 <- quantile(Craigslist_data$price, .75)
IQR <- IQR(Craigslist_data$price)
Craigslist_data <- subset(Craigslist_data, Craigslist_data$price> (Q1 - 1.5*IQR) & Craigslist_data$price< (Q3 + 1.5*IQR))
range(Craigslist_data$price)
## [1] 0 36250
nrow(Craigslist_data)
## [1] 16529
data0 = Craigslist_data %>% count(state, name = "Count")
data0
## state Count
## 1 ak 516
## 2 al 809
## 3 ar 642
## 4 az 1537
## 5 ca 7945
## 6 co 1736
## 7 ct 929
## 8 dc 469
## 9 de 215
## 10 fl 1731
plot_ly(
data = data0,
x = data0$Count,
y = data0$state,
name = "SF Zoo",
type = "bar", marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))
) %>%
layout(yaxis = list(categoryorder = "total ascending", title = 'State'),title = 'State-wise count of listed vehicles', xaxis = list(title = 'Count'))
data1 = Craigslist_data %>% count(condition, name = "Count")
data1
## condition Count
## 1 excellent 8727
## 2 fair 666
## 3 good 4866
## 4 like new 2130
## 5 new 74
## 6 salvage 66
Percentage= (data1$Count/sum(data1$Count))*100
data1 = mutate(data1, Percentage = Percentage)
data1
## condition Count Percentage
## 1 excellent 8727 52.7981124
## 2 fair 666 4.0292819
## 3 good 4866 29.4391675
## 4 like new 2130 12.8864420
## 5 new 74 0.4476980
## 6 salvage 66 0.3992982
Condition = data1$condition
Percentage = data1$Percentage
pie1 = ggplot(data = data1, aes(x="", y = Percentage, fill = Condition)) +
geom_col(color = "White") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(round(Percentage,2), "%")),
position = position_stack(vjust = 0.6), size = 3) +
theme(panel.background = element_blank(),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5, size = 18)) +
ggtitle("Percentage of vehicles as per the Condition")
pie1
data2 = Craigslist_data %>%
group_by(manufacturer) %>%
summarise(max_price=max(price))
data2 = data2[order(-data2$max_price), ]
data2
## # A tibble: 41 × 2
## manufacturer max_price
## <chr> <dbl>
## 1 gmc 36250
## 2 ford 36000
## 3 jeep 36000
## 4 mitsubishi 36000
## 5 pontiac 36000
## 6 ram 36000
## 7 toyota 36000
## 8 chevrolet 35999
## 9 dodge 35950
## 10 bmw 35900
## # … with 31 more rows
data2$manufacturer <- factor(data2$manufacturer,
levels = data2$manufacturer[order(data2$max_price, decreasing = TRUE)])
p <- ggplot(data=data2, aes(x= manufacturer, y=max_price, fill= manufacturer)) +
geom_bar(stat="identity")+ ggtitle("Maximum price vehicle for each manufacturer")
fig <- ggplotly(p)
fig
data3 = Craigslist_data %>%
group_by(manufacturer) %>%
summarise(min_price=min(price))
data3 = data3[order(data3$min_price), ]
data3
## # A tibble: 41 × 2
## manufacturer min_price
## <chr> <dbl>
## 1 acura 0
## 2 audi 0
## 3 bmw 0
## 4 buick 0
## 5 cadillac 0
## 6 chevrolet 0
## 7 chrysler 0
## 8 dodge 0
## 9 ferrari 0
## 10 fiat 0
## # … with 31 more rows
data3$manufacturer <- factor(data3$manufacturer,
levels = data3$manufacturer[order(data3$min_price, decreasing = FALSE)])
p <- ggplot(data=data3, aes(x= manufacturer, y=min_price, fill= manufacturer)) +
geom_bar(stat="identity")+ ggtitle("Minimum price vehicle for each manufacturer")
fig <- ggplotly(p)
fig
Craigslist_data$region = as.factor(Craigslist_data$region)
Craigslist_data$year = as.factor(Craigslist_data$year)
Craigslist_data$manufacturer = as.factor(Craigslist_data$manufacturer)
Craigslist_data$model = as.factor(Craigslist_data$model)
Craigslist_data$condition = as.factor(Craigslist_data$condition)
Craigslist_data$cylinders = as.factor(Craigslist_data$cylinders)
Craigslist_data$fuel = as.factor(Craigslist_data$fuel)
Craigslist_data$title_status = as.factor(Craigslist_data$title_status)
Craigslist_data$transmission = as.factor(Craigslist_data$transmission)
Craigslist_data$drive = as.factor(Craigslist_data$drive)
Craigslist_data$size = as.factor(Craigslist_data$size)
Craigslist_data$type = as.factor(Craigslist_data$type)
Craigslist_data$paint_color = as.factor(Craigslist_data$paint_color)
Craigslist_data$state = as.factor(Craigslist_data$state)
str(Craigslist_data)
## 'data.frame': 16529 obs. of 18 variables:
## $ region : Factor w/ 75 levels "anchorage / mat-su",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ price : num 15000 19900 14000 22500 15000 3000 9500 17500 0 0 ...
## $ year : Factor w/ 93 levels "1918","1923",..: 85 76 84 73 89 76 75 80 90 91 ...
## $ manufacturer: Factor w/ 41 levels "acura","alfa-romeo",..: 14 14 17 14 11 9 9 39 8 8 ...
## $ model : Factor w/ 3667 levels "-benz s430","(cng) 2500 express van",..: 1477 1587 2421 1625 852 3309 3309 1675 1439 1439 ...
## $ condition : Factor w/ 6 levels "excellent","fair",..: 1 3 1 3 1 3 1 3 4 4 ...
## $ cylinders : Factor w/ 8 levels "10 cylinders",..: 6 7 6 7 7 6 6 6 6 6 ...
## $ fuel : Factor w/ 5 levels "diesel","electric",..: 3 1 3 1 3 3 3 3 3 3 ...
## $ odometer : int 128000 88000 95000 144700 90000 176144 30376 201300 68472 69125 ...
## $ title_status: Factor w/ 6 levels "clean","lien",..: 1 1 1 1 5 1 1 1 1 1 ...
## $ transmission: Factor w/ 3 levels "automatic","manual",..: 1 1 1 2 1 1 1 2 1 1 ...
## $ drive : Factor w/ 3 levels "4wd","fwd","rwd": 3 1 2 3 3 2 2 1 3 3 ...
## $ size : Factor w/ 4 levels "compact","full-size",..: 2 2 2 2 3 3 3 2 2 2 ...
## $ type : Factor w/ 13 levels "bus","convertible",..: 11 8 5 11 9 5 5 6 12 12 ...
## $ paint_color : Factor w/ 12 levels "black","blue",..: 1 2 10 11 6 10 2 1 11 11 ...
## $ state : Factor w/ 10 levels "ak","al","ar",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ lat : num 32.6 32.5 32.6 32.6 32.8 ...
## $ long : num -85.5 -85.5 -85.5 -85.4 -85.8 ...
set.seed(1)
sample <- sample.split(Craigslist_data$price, SplitRatio = 0.7)
train <- subset(Craigslist_data, sample == TRUE)
test <- subset(Craigslist_data, sample == FALSE)
dim(train)
## [1] 11703 18
dim(test)
## [1] 4826 18
model1 <- lm(price~., data = train)
str(summary(model1))
## List of 11
## $ call : language lm(formula = price ~ ., data = train)
## $ terms :Classes 'terms', 'formula' language price ~ region + year + manufacturer + model + condition + cylinders + fuel + odometer + title_status + tran| __truncated__ ...
## .. ..- attr(*, "variables")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel, odometer, title_status, transmiss| __truncated__ ...
## .. ..- attr(*, "factors")= int [1:18, 1:17] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:18] "price" "region" "year" "manufacturer" ...
## .. .. .. ..$ : chr [1:17] "region" "year" "manufacturer" "model" ...
## .. ..- attr(*, "term.labels")= chr [1:17] "region" "year" "manufacturer" "model" ...
## .. ..- attr(*, "order")= int [1:17] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel, odometer, title_status, transmiss| __truncated__ ...
## .. ..- attr(*, "dataClasses")= Named chr [1:18] "numeric" "factor" "factor" "factor" ...
## .. .. ..- attr(*, "names")= chr [1:18] "price" "region" "year" "manufacturer" ...
## $ residuals : Named num [1:11703] 2.68e+03 8.32e+03 1.38e+04 -4.11e-09 8.31e+03 ...
## ..- attr(*, "names")= chr [1:11703] "32" "60" "66" "74" ...
## $ coefficients : num [1:3310, 1:4] 23054 -13705 -4846 -6019 -5114 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3310] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:3354] FALSE FALSE FALSE FALSE FALSE FALSE ...
## ..- attr(*, "names")= chr [1:3354] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## $ sigma : num 5128
## $ df : int [1:3] 3310 8393 3354
## $ r.squared : num 0.745
## $ adj.r.squared: num 0.644
## $ fstatistic : Named num [1:3] 7.4 3309 8393
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:3310, 1:3310] 6.028 -0.369 -0.245 -0.356 -0.262 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3310] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## .. ..$ : chr [1:3310] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## - attr(*, "class")= chr "summary.lm"
#Forward_sec = ols_step_forward_p(model1,penter=.05)
#Forward_sec
model,year,region. They have
3835-factor values, which means it added 3835 different columns.model,year,region.train2 = subset(train,select = -c(1,3,5))
model2 <- lm(price~., data = train2)
str(summary(model2))
## List of 11
## $ call : language lm(formula = price ~ ., data = train2)
## $ terms :Classes 'terms', 'formula' language price ~ manufacturer + condition + cylinders + fuel + odometer + title_status + transmission + drive + size | __truncated__
## .. ..- attr(*, "variables")= language list(price, manufacturer, condition, cylinders, fuel, odometer, title_status, transmission, drive, size, typ| __truncated__
## .. ..- attr(*, "factors")= int [1:15, 1:14] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:15] "price" "manufacturer" "condition" "cylinders" ...
## .. .. .. ..$ : chr [1:14] "manufacturer" "condition" "cylinders" "fuel" ...
## .. ..- attr(*, "term.labels")= chr [1:14] "manufacturer" "condition" "cylinders" "fuel" ...
## .. ..- attr(*, "order")= int [1:14] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(price, manufacturer, condition, cylinders, fuel, odometer, title_status, transmission, drive, size, typ| __truncated__
## .. ..- attr(*, "dataClasses")= Named chr [1:15] "numeric" "factor" "factor" "factor" ...
## .. .. ..- attr(*, "names")= chr [1:15] "price" "manufacturer" "condition" "cylinders" ...
## $ residuals : Named num [1:11703] -4173 2152 3298 3329 -3222 ...
## ..- attr(*, "names")= chr [1:11703] "32" "60" "66" "74" ...
## $ coefficients : num [1:104, 1:4] 44124 11643 20036 1078 674 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:104] FALSE FALSE FALSE FALSE FALSE FALSE ...
## ..- attr(*, "names")= chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
## $ sigma : num 7151
## $ df : int [1:3] 104 11599 104
## $ r.squared : num 0.314
## $ adj.r.squared: num 0.308
## $ fstatistic : Named num [1:3] 51.6 103 11599
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:104, 1:104] 0.35976 -0.01072 -0.01232 -0.00852 -0.00834 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
## .. ..$ : chr [1:104] "(Intercept)" "manufactureralfa-romeo" "manufactureraston-martin" "manufactureraudi" ...
## - attr(*, "class")= chr "summary.lm"
Forward_sec = ols_step_forward_p(model2,penter=.05)
Forward_sec
##
## Selection Summary
## -------------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -------------------------------------------------------------------------------------
## 1 type 0.1141 0.1132 3288.6333 243862.7258 8096.8747
## 2 condition 0.1960 0.1949 1904.2575 242736.7956 7714.9536
## 3 state 0.2273 0.2256 1377.6022 242290.8460 7566.4493
## 4 drive 0.2456 0.2438 1068.8165 242013.2604 7476.6070
## 5 odometer 0.2604 0.2585 821.7322 241784.5834 7403.6006
## 6 transmission 0.2736 0.2717 599.3190 241576.7876 7337.5378
## 7 fuel 0.2847 0.2826 413.3343 241404.3735 7282.4453
## 8 manufacturer 0.2967 0.2922 212.9842 241287.0131 7233.7085
## 9 size 0.3019 0.2972 127.0926 241206.2409 7207.8709
## 10 paint_color 0.3084 0.3031 19.7023 241119.3374 7177.8090
## 11 cylinders 0.3117 0.3061 -35.4119 241076.0721 7162.4282
## 12 long 0.3136 0.3079 -65.1230 241046.1558 7152.9763
## 13 title_status 0.3144 0.3083 -75.8259 241043.3462 7150.6033
## -------------------------------------------------------------------------------------
lat
long, state.train3 = subset(train,select = -c(16,17,18))
model3 <- lm(price~., data = train3)
str(summary(model3))
## List of 11
## $ call : language lm(formula = price ~ ., data = train3)
## $ terms :Classes 'terms', 'formula' language price ~ region + year + manufacturer + model + condition + cylinders + fuel + odometer + title_status + tran| __truncated__ ...
## .. ..- attr(*, "variables")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel, odometer, title_status, transmiss| __truncated__
## .. ..- attr(*, "factors")= int [1:15, 1:14] 0 1 0 0 0 0 0 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:15] "price" "region" "year" "manufacturer" ...
## .. .. .. ..$ : chr [1:14] "region" "year" "manufacturer" "model" ...
## .. ..- attr(*, "term.labels")= chr [1:14] "region" "year" "manufacturer" "model" ...
## .. ..- attr(*, "order")= int [1:14] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(price, region, year, manufacturer, model, condition, cylinders, fuel, odometer, title_status, transmiss| __truncated__
## .. ..- attr(*, "dataClasses")= Named chr [1:15] "numeric" "factor" "factor" "factor" ...
## .. .. ..- attr(*, "names")= chr [1:15] "price" "region" "year" "manufacturer" ...
## $ residuals : Named num [1:11703] 2.67e+03 8.25e+03 1.40e+04 -3.94e-09 8.34e+03 ...
## ..- attr(*, "names")= chr [1:11703] "32" "60" "66" "74" ...
## $ coefficients : num [1:3308, 1:4] 15640 -10556 -3417 -2806 -2936 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3308] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:3343] FALSE FALSE FALSE FALSE FALSE FALSE ...
## ..- attr(*, "names")= chr [1:3343] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## $ sigma : num 5129
## $ df : int [1:3] 3308 8395 3343
## $ r.squared : num 0.745
## $ adj.r.squared: num 0.644
## $ fstatistic : Named num [1:3] 7.41 3307 8395
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:3308, 1:3308] 5.22642 -0.00284 -0.0076 -0.00574 -0.0053 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3308] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## .. ..$ : chr [1:3308] "(Intercept)" "regionauburn" "regionbakersfield" "regionbirmingham" ...
## - attr(*, "class")= chr "summary.lm"
test3 = subset(test,select = -c(16,17,18))
modelResiduals <- as.data.frame(residuals(model3))
ggplot(modelResiduals, aes(residuals(model3))) +
geom_histogram(fill='deepskyblue', color='black')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
test3$year[which(!(test3$year %in% unique(train3$year)))] <- NA
test3$model[which(!(test3$model %in% unique(train3$model)))] <- NA
test3 = na.omit(test3)
preds = predict(model3, test3)
## Warning in predict.lm(model3, test3): prediction from a rank-deficient fit may
## be misleading
modelEval = cbind(test3$price, preds)
colnames(modelEval) = c('Actual', 'Predicted')
modelEval = as.data.frame(modelEval)
modelEval$Predicted = round(modelEval$Predicted)
head(modelEval)
## Actual Predicted
## 56 19900 10555
## 111 9500 -3651
## 139 4000 -1943
## 220 2500 -821
## 223 12500 14195
## 245 0 32661
mse1 = mean((modelEval$Actual - modelEval$Predicted)^2)
rmse1 = sqrt(mse1)
rmse1
## [1] 5589.309
price using XGBoost.train_x = data.matrix(train3[, -2])
train_y = train3[,2]
test_x = data.matrix(test3[, -2])
test_y = test3[, 2]
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
model4 = xgb.train(data = xgb_train, max.depth = 3, nrounds = 700, verbose = 0)
preds2 = predict(model4, xgb_test)
modelEval2 <- cbind(test_y, preds2)
colnames(modelEval2) <- c('Actual', 'Predicted')
modelEval2 <- as.data.frame(modelEval2)
modelEval2$Predicted <- round(modelEval2$Predicted)
head(modelEval2)
## Actual Predicted
## 1 19900 18554
## 2 9500 7217
## 3 4000 3886
## 4 2500 2567
## 5 12500 11954
## 6 0 17950
mse2 = mean((test_y - preds2)^2)
rmse2 = sqrt(mse2)
rmse2
## [1] 4424.406
importance_matrix = xgb.importance(colnames(xgb_train), model = model4)
xgb.plot.importance(importance_matrix[1:14,])
year is the most important
variable followed by odometer.